This project began in June 2022 as an opportunity to learn web scraping using the language R inside the RStudio (now POSIT) IDE. Quickly the project got more involved as I saw opportunities to learn and practice mapping skills, time series ML predictions, and so much more around data wrangling, plotting and visualizations, performing statistical tests, geocoding, and more web scraping with Python when I ran into a roadblock with an R package and a finicky web page.
Despite the heavy nature of the material, this has been a fun and interesting project, and I have learned so much from the R community that I hope that I can begin to repay that debt to some extent with this project. I think some aspect of what is contained within will help make some analyst’s or data scientist’s life easier, and that is the experience I have had with R and the R community - free and open software and generous widespread sharing of knowledge.
The beauty of R is that I will share this analysis with everyone, and, if anyone is skeptical or curious, which people should be, they can get the same data and challenge my findings and conclusions with their own analysis. I invite you to do so, if you are so inclined. Please feel free to contact me if you would like to chat as I invite all constructive criticism.
What follows is an analysis of mass shooting phenomena in the US.
Guns. Gun rights. Gun violence. These are highly contentious and controversial issues that come with a host of tuning effects on individual opinions and beliefs.
The intent of this analysis is purely scientific and it has neither political motive nor personal agenda other than to inform people about the state of mass shooting gun violence in the United States of America.
An open and scientific mind was applied to these data both as a learning experience for the author and an informative experience for the public.
Gun violence is a pervasive issue in American culture. This analysis focuses on mass shootings in particular, and it does not include all types of US gun-related incidents.
The source of these data used in this project is the Gun Violence Archive (GVA), and the original sources of the shootings are news media reports and law enforcement reports. The GVA can be located at https://www.gunviolencearchive.org. Special thanks is due to the GVA for their effort in recording and documenting these data and sharing them with the public.
A mass shooting is defined by the Gun Violence Archive as a shooting in which four or more people are injured or killed not including the perpetrator(s) of the shooting.
I approached this project in three frames: exploratory data analysis, mapping, and time series forecasting.
We begin by loading in libraries that help with the data handling and analytical process.
library(tidyverse)
library(tidymodels)
library(tidytext)
library(lubridate)
library(scales)
library(janitor)
library(skimr)
library(tidyquant)
library(broom)
library(purrr)
library(rvest)
library(tidycensus)
library(knitr)
library(leaflet)
library(usmap)
library(reticulate)
library(modeltime)
library(tictoc)
library(future)
library(doFuture)
library(timetk)
library(thief)
registerDoFuture()
n_cores <- parallel::detectCores()
plan(strategy = cluster,
workers = parallel::makeCluster(n_cores))
api_key <- Sys.getenv("CENSUS_API_KEY")
To get the prolific amount of data from the GVA website, we can employ web scraping to read the html and turn it into a tidy tabular format for analysis. I originally began this project as a way to learn web scraping, and the project evolved into the desire to learn even more as I explored the data.
The GVA URLs were different by year, and they even changed during the course of this project. I decided to break the scraping up by year as that is how the website is organized to some extent.
#2014
# create a tibble with a column of the website page names that we want to scrape
ms_pages_2014 <-
tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shootings/2014",
str_c("https://www.gunviolencearchive.org/reports/mass-shootings/2014?page=",c(1:10))))
# use the purrr package to iterate over the column of URLs using the function from the xml2 package
ms_scrape_2014 <- ms_pages_2014 %>%
mutate(scrape = map(.x = pages, .f = ~read_html(.x)))
# convert the html into a table again using purrr's map function to interate over the scraped data
ms_2014_tbl <- map_df(ms_scrape_2014$scrape, ~ html_table(.x))
# repeat for additional URLs
#2015
ms_pages_2015 <-
tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shootings/2015",
str_c("https://www.gunviolencearchive.org/reports/mass-shootings/2015?page=",c(1:13))))
ms_scrape_2015 <- ms_pages_2015 %>%
mutate(scrape = map(.x = pages, .f = ~read_html(.x)))
ms_2015_tbl <- map_df(ms_scrape_2015$scrape, ~ html_table(.x))
#2016
ms_pages_2016 <-
tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shooting?year=2016",
str_c("https://www.gunviolencearchive.org/reports/mass-shooting?page=",c(1:15),"&year=2016")))
ms_scrape_2016 <- ms_pages_2016 %>%
mutate(scrape = map(.x = pages, .f = ~read_html(.x)))
ms_2016_tbl <- map_df(ms_scrape_2016$scrape, ~ html_table(.x))
#2017
ms_pages_2017 <-
tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shooting?year=2017",
str_c("https://www.gunviolencearchive.org/reports/mass-shooting?page=",c(1:13),"&year=2017")))
ms_scrape_2017 <- ms_pages_2017 %>%
mutate(scrape = map(.x = pages, .f = ~read_html(.x)))
ms_2017_tbl <- map_df(ms_scrape_2017$scrape, ~ html_table(.x))
#2018
ms_pages_2018 <-
tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shooting?year=2018",
str_c("https://www.gunviolencearchive.org/reports/mass-shooting?page=",c(1:13),"&year=2018")))
ms_scrape_2018 <- ms_pages_2018 %>%
mutate(scrape = map(.x = pages, .f = ~read_html(.x)))
ms_2018_tbl <- map_df(ms_scrape_2018$scrape, ~ html_table(.x))
#2019
ms_pages_2019 <-
tibble(pages = c("https://www.gunviolencearchive.org/reports/mass-shooting?year=2019",
str_c("https://www.gunviolencearchive.org/reports/mass-shooting?page=",c(1:16),"&year=2019")))
ms_scrape_2019 <- ms_pages_2019 %>%
mutate(scrape = map(.x = pages, .f = ~read_html(.x)))
ms_2019_tbl <- map_df(ms_scrape_2019$scrape, ~ html_table(.x))
#2019 through 2022
ms_pages_2019_2022 <-
tibble(pages = c("https://www.gunviolencearchive.org/mass-shooting",
str_c("https://www.gunviolencearchive.org/mass-shooting?page=",c(1:79))))
ms_scrape_2019_2022 <- ms_pages_2019_2022 %>%
mutate(scrape = map(.x = pages, .f = ~read_html(.x)))
ms_2019_2022_tbl <- map_df(ms_scrape_2019_2022$scrape, ~ html_table(.x))
Once I have collected the different years, we can stack the scraped datasets, and then store the file locally so that we don’t need to continually call on the website.
complete_tbl <- ms_2014_tbl %>%
bind_rows(ms_2015_tbl) %>%
bind_rows(ms_2016_tbl) %>%
bind_rows(ms_2017_tbl) %>%
bind_rows(ms_2018_tbl) %>%
bind_rows(ms_2019_tbl) %>%
bind_rows(ms_2019_2022_tbl) %>%
distinct()
write_rds(complete_tbl, "gva_data.rds")
Now we can read in the saved data file locally.
complete_tbl <- read_rds("gva_data.rds")
Here I prepare the primary data set to include some date-related fields for convenience when plotting and summarizing the data.
ms_tbl <- complete_tbl %>%
janitor::clean_names() %>%
select(-operations) %>%
mutate(incident_date = mdy(incident_date),
year = year(incident_date),
month = month(incident_date, label = T, abbr = T),
incident_month = rollforward(incident_date),
dow = wday(incident_date, label = TRUE)) %>%
arrange(incident_date)
This is a function to group and summarize the data in different ways since we will need to do this many times later. Functions are very helpful in R particularly when you need to do the same thing over and over.
group_function <- function(df, ...) {
output <- df %>%
group_by(...) %>%
summarize(n_incidents = n(),
n_deaths = sum(number_killed, na.rm = T),
n_injuries = sum(number_injured, na.rm = T),
.groups = "drop")
return(output)
}
This is an initial look at a sample of ten rows of the data. Each row represents one mass shooting incident. Each incident has its own unique ID, a date, geographic location, and statistics about the number of injuries and deaths as well as some date related data that I added above to assist with analysis.
ms_tbl %>%
sample_n(size = 10)
Below is a broad summary of the data. No data are missing. Some notable points with regards to the day of the week and perhaps the month. Deaths and injuries range from zero to 59 and 441 respectively, per incident.
skim(ms_tbl)
| Name | ms_tbl |
| Number of rows | 4043 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| Date | 2 |
| factor | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 4 | 20 | 0 | 49 | 0 |
| city_or_county | 0 | 1 | 3 | 39 | 0 | 1039 | 0 |
| address | 0 | 1 | 3 | 54 | 0 | 3992 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| incident_date | 0 | 1 | 2014-01-01 | 2023-01-01 | 2019-10-27 | 2015 |
| incident_month | 0 | 1 | 2014-01-31 | 2023-01-31 | 2019-10-31 | 109 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| month | 0 | 1 | TRUE | 12 | Jul: 514, Jun: 482, Aug: 429, Sep: 383 |
| dow | 0 | 1 | TRUE | 7 | Sun: 1134, Sat: 962, Fri: 463, Mon: 399 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| incident_id | 0 | 1 | 1414230.22 | 721235.38 | 92194 | 756901 | 1537044 | 2053155 | 2492328 | ▅▅▅▆▇ |
| number_killed | 0 | 1 | 1.06 | 1.98 | 0 | 0 | 1 | 1 | 59 | ▇▁▁▁▁ |
| number_injured | 0 | 1 | 4.18 | 7.25 | 0 | 3 | 4 | 5 | 441 | ▇▁▁▁▁ |
| year | 0 | 1 | 2018.77 | 2.56 | 2014 | 2017 | 2019 | 2021 | 2023 | ▃▅▅▇▃ |
ms_tbl %>%
group_function() %>%
mutate(inj_death_ratio = n_injuries / n_deaths,
death_inc_ratio = n_deaths / n_incidents,
inj_inc_ratio = n_injuries / n_incidents)
Here we see a couple of patterns. First, the volume of mass shootings is increasing. Second, there appears to be a seasonal component to when mass shootings occur meaning there is a tendency and pattern to mass shooting volume consistent with the month of the year.
ms_tbl %>%
filter(incident_month < floor_date(Sys.Date(), "month")) %>%
group_function(incident_month) %>%
ggplot(aes(incident_month, n_incidents)) +
geom_col(fill = "steelblue", color = "gray30", alpha = 0.9) +
geom_smooth(method = "loess", se = FALSE, color = "black") +
scale_x_date(date_labels = "%Y", date_breaks = "1 year") +
theme_tq() +
scale_color_tq() +
labs(title = "Number of Mass Shooting Incidents by Month and Year",
subtitle = str_glue("From {min(ms_tbl$incident_date)} through {max(ms_tbl %>% filter(incident_month < floor_date(Sys.Date(), 'month')) %>% pull(incident_date))}"),
x = "Month",
y = "",
caption = "current incomplete month removed")
How much have mass shooting incidents, deaths, and injuries been changing over time? To find out, we can use the geometric mean on the annual rates of change to get a percentage increase or decrease from the prior year.
growth_rates <- ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(year) %>%
mutate(across(c(n_incidents:n_injuries), ~. / lag(.), .names = "{.col}_delta")) %>%
drop_na()
gr_table <- growth_rates %>%
select(year, contains("delta")) %>%
mutate(across(contains("delta"), ~. - 1))
gr_table %>%
mutate(across(contains("delta"), ~percent(., accuracy = 0.1))) %>%
gt::gt()
| year | n_incidents_delta | n_deaths_delta | n_injuries_delta |
|---|---|---|---|
| 2015 | 23.1% | 34.2% | 23.1% |
| 2016 | 14.0% | 22.8% | 15.1% |
| 2017 | -9.1% | -2.6% | 17.4% |
| 2018 | -3.4% | -15.6% | -26.4% |
| 2019 | 24.1% | 25.0% | 28.7% |
| 2020 | 46.3% | 10.3% | 48.4% |
| 2021 | 13.1% | 37.4% | 11.3% |
| 2022 | -6.1% | -4.7% | -4.3% |
This chart shows how the volume of mass shootings and the resulting deaths and injuries have either increased or decreased over the years. Anything above zero is an increase from the prior year, and anything below zero is a decrease.
gr_table %>%
pivot_longer(cols = contains("delta")) %>%
ggplot(aes(year, value, color = name)) +
geom_segment(aes(x = year, xend = year, y = 0, yend = value)) +
geom_point(size = 5) +
geom_text(aes(label = percent(value, accuracy = 0.1)), size = 3, color = "black", vjust = -1.2) +
geom_hline(yintercept = 0, color = "gray30") +
expand_limits(x = c(2014.5, 2022.5), y = c(-0.25, 0.5)) +
scale_y_continuous(labels = label_percent(accuracy = 1)) +
theme_tq() +
scale_color_tq() +
facet_grid(~fct_inorder(name)) +
labs(title = "Annual Percentage Change in Volume of Mass Shootings, Deaths, and Injuries",
x = "",
y = "% Change From Prior Year",
caption = "current incomplete year removed") +
theme(legend.position = "none",
plot.title = element_text(size = 12))
This is the average growth rate over the entire period of time in total. Effectively, you can read this summary as the annual average increase in the number of incidents, deaths, and injuries since 2014. On average, mass shooting incidents, deaths, and injuries have been growing at a rate of 11% to 12% per year since 2014.
gr_table %>%
mutate("Start Year" = min(year)-1,
"End Year" = max(year)) %>%
group_by(`Start Year`, `End Year`) %>%
summarize(across(contains("delta"), ~exp(mean(log(. + 1))) - 1),
.groups = "drop") %>%
mutate(across(contains("delta"), ~percent(., accuracy = 0.1))) %>%
pivot_longer(cols = contains("delta"), names_to = "Type", values_to = "Growth Rate") %>%
gt::gt()
| Start Year | End Year | Type | Growth Rate |
|---|---|---|---|
| 2014 | 2022 | n_incidents_delta | 11.4% |
| 2014 | 2022 | n_deaths_delta | 11.8% |
| 2014 | 2022 | n_injuries_delta | 12.1% |
How many mass shootings do we see each year?
ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(year) %>%
ggplot(aes(year, n_incidents, fill = n_incidents)) +
geom_col(alpha = 0.6) +
geom_text(aes(label = n_incidents), nudge_y = -20) +
theme_tq() +
scale_fill_gradient(high = "red", low = "blue") +
scale_x_continuous(breaks = seq(min(ms_tbl$year), max(ms_tbl$year), by = 1)) +
labs(title = "Number of Mass Shootings by Year",
subtitle = str_glue("From {min(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))} through {max(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))}"),
x = "Year",
y = "",
caption = "current incomplete year removed") +
theme(legend.position = "none")
Which months tend to have more or less mass shootings?
ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(month) %>%
ggplot(aes(month, n_incidents, fill = n_incidents)) +
geom_col(alpha = 0.6) +
geom_text(aes(label = n_incidents), nudge_y = 15) +
theme_tq() +
scale_fill_gradient(high = "red", low = "blue") +
labs(title = "Number of Mass Shootings by Month (Year Over Year)",
subtitle = str_glue("From {min(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))} through {max(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))}"),
x = "",
y = "",
caption = "current incomplete year removed") +
theme(legend.position = "none")
Follow up question. What proportion of shootings happen in the different seasons (roughly approximated by season starting month)?
ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(incident_id, month) %>%
select(month:n_incidents) %>%
summarize(n = n(),
winter = sum(n_incidents[month %in% c("Dec","Jan","Feb")]),
spring = sum(n_incidents[month %in% c("Mar","Apr","May")]),
summer = sum(n_incidents[month %in% c("Jun","Jul","Aug")]),
fall = sum(n_incidents[month %in% c("Sep","Oct","Nov")]),
.groups = "drop") %>%
mutate(spring_prop = spring / n,
summer_prop = summer / n,
fall_prop = fall / n,
winter_prop = winter / n) %>%
select(contains("prop")) %>%
mutate(across(everything(), ~ percent(., accuracy = 0.1)))
Mass shootings are approximately 10% more common in the summer months and approximately 10% less common in the winter when compared to spring and fall months.
Which days of the week tend to have more or less mass shootings?
ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(dow) %>%
ggplot(aes(dow, n_incidents, fill = n_incidents)) +
geom_col(alpha = 0.6) +
geom_text(aes(label = n_incidents), nudge_y = 30) +
theme_tq() +
scale_fill_gradient(high = "red", low = "blue") +
labs(title = "Number of Mass Shootings by Weekday (Year Over Year)",
subtitle = str_glue("From {min(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))} through {max(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))}"),
x = "",
y = "",
caption = "current incomplete year removed") +
theme(legend.position = "none")
Follow up question. What is the proportion of weekend mass shootings?
ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(incident_id, dow) %>%
select(dow:n_incidents) %>%
summarize(n = n(),
sat_sun = sum(n_incidents[dow %in% c("Sat","Sun")]),
.groups = "drop") %>%
transmute(weekend_prop = sat_sun / n,
weekend_prop = percent(weekend_prop, accuracy = 0.1))
About 51% of mass shootings occur on weekends.
Here is one more visualization to help us see the number of mass shootings by day of the week and the month in which they occur. Each row of bars represents 100% of the shootings in the given days and months.
ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(month, dow) %>%
select(month:n_incidents) %>%
ggplot(aes(fct_rev(month), n_incidents, fill = dow)) +
geom_col(position = position_fill(reverse = TRUE), alpha = 0.8) +
coord_flip() +
scale_y_continuous(labels = scales::label_percent(accuracy = 1)) +
theme_tq() +
labs(title = "Proportion of Mass Shootings by Month and Weekday, Year Over Year",
subtitle = str_glue("From {min(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))} through {max(ms_tbl %>% filter(year < year(Sys.Date())) %>% pull(incident_date))}"),
x = "",
y = "",
caption = "current incomplete year removed",
fill = "Day of Week") +
guides(fill = guide_legend(nrow = 1, byrow = TRUE))
We have seen that there are some days that have higher numbers of mass shootings as well as some months that are higher than others. That creates questions.
Is there a significant difference in the number of mass shootings by the day of the week? We can use a chi-squared test checking for the goodness of fit where we essentially test the assumption that there is no difference in the proportion of mass shootings by day of the week. In other words, each day should have proportionally the same volume of mass shootings.
chi_dow_tbl <- ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(incident_id, dow) %>%
select(dow:n_incidents)
chi_dow_tbl %>% table()
## n_incidents
## dow 1
## Sun 1132
## Mon 399
## Tue 340
## Wed 385
## Thu 360
## Fri 463
## Sat 962
chisq_test(chi_dow_tbl,
response = dow,
p = c("Sun" = 1/7,
"Mon" = 1/7,
"Tue" = 1/7,
"Wed" = 1/7,
"Thu" = 1/7,
"Fri" = 1/7,
"Sat" = 1/7))
The answer here is yes, there is a significant difference in the volume of mass shooting incidents in different days of the week at the p < 0.05 level.
Is there a significant difference between months in terms of the number of mass shooting incidents?
chi_month_tbl <- ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(incident_id, month) %>%
select(month:n_incidents)
chi_month_tbl %>% table()
## n_incidents
## month 1
## Jan 223
## Feb 216
## Mar 231
## Apr 301
## May 380
## Jun 482
## Jul 514
## Aug 429
## Sep 383
## Oct 345
## Nov 304
## Dec 233
chisq_test(chi_month_tbl,
response = month,
p = c("Jan" = 1/12,
"Feb" = 1/12,
"Mar" = 1/12,
"Apr" = 1/12,
"May" = 1/12,
"Jun" = 1/12,
"Jul" = 1/12,
"Aug" = 1/12,
"Sep" = 1/12,
"Oct" = 1/12,
"Nov" = 1/12,
"Dec" = 1/12))
Yes, there is a significant difference between some months at the p < 0.05 level.
Is there a statistically significant relationship between the day of the week when a mass shooting occurs and the month in which the shooting happens?
We can use a chi-squared test of independence to test our question.
chi_tbl <- ms_tbl %>%
filter(year(incident_date) < year(Sys.Date())) %>%
group_function(incident_id, month, dow) %>%
select(month:n_incidents)
chi_tbl %>%
table()
## , , n_incidents = 1
##
## dow
## month Sun Mon Tue Wed Thu Fri Sat
## Jan 61 23 17 26 23 29 44
## Feb 56 18 20 22 19 23 58
## Mar 66 19 17 20 16 36 57
## Apr 89 23 22 28 25 34 80
## May 111 37 35 40 25 39 93
## Jun 135 43 46 44 46 48 120
## Jul 139 66 35 53 56 49 116
## Aug 132 43 38 44 29 43 100
## Sep 111 44 36 39 31 32 90
## Oct 90 32 26 26 41 40 90
## Nov 83 30 23 29 28 42 69
## Dec 59 21 25 14 21 48 45
chisq_test(x = chi_tbl, formula = dow ~ month)
The answer is that we don’t have enough evidence to reject the null hypothesis which asserts that there is no difference between the mass shooting days of the week and the months in which they occur. The Sundays in Jun, Jul, and Aug stand out with higher volumes, but there isn’t a statistically significant difference in those months and days of the week at the p < 0.05 significance level.
This is interesting given that the days of the week and the months tests were both significant, but in concert there is no evidence to suggest that they are significantly different.
We noted before that there are seasonal trends in the data, so I am going to bring those forward using moving averages which are simply an average of the volume of mass shootings over a given window of time. In this case, I will use two windows - one for a six month moving average to highlight the seasonality, and a second moving average of 12 months to highlight the trend of increasing numbers of mass shootings. We appear to have a new post-pandemic normal.
ma_tbl <- ms_tbl %>%
filter(incident_date < floor_date(Sys.Date(), "month")) %>%
arrange(incident_date) %>%
group_function(incident_month)
ma_tbl %>%
ggplot(aes(incident_month, n_incidents)) +
geom_col(fill = "steelblue", color = "gray30", alpha = 0.4) +
geom_line(data = ma_tbl %>%
select(1,2) %>%
mutate(ma_6 = rollmean(n_incidents, k = 6, fill = NA, align = "right")),
aes(incident_month, ma_6), color = "#F44336", size = 1.5) +
geom_line(data = ma_tbl %>%
select(1,2) %>%
mutate(ma_12 = rollmean(n_incidents, k = 12, fill = NA, align = "right")),
aes(incident_month, ma_12), color = "black", size = 1.5)+
theme_tq() +
scale_color_tq() +
labs(title = "Mass Shooting Monthly Incident Count Moving Averages",
subtitle = "Red line is the 6 month moving average, black line is the 12 month moving average",
x = "",
y = "Number of Mass Shooting Incidents",
caption = "current incomplete month removed")
It’s now clearer to see the increasing trend in the twelve month moving average line in black, and the seasonal monthly trend of the six month moving average in red. You can see how the red line moves above and below the black line clearly revealing the seasonal component of mass shootings.
Additionally, we can also see that since 2020, the variability in the number of mass shootings has increased as the red line goes much higher than in previous years. This is obvious given the increase in volume, but it’s worth noting the change in the total system behavior. Further, whilst 2020 was an aberrant year given the pandemic, assuming a potential relationship between the impact of the pandemic on US society and mass shooting behavior, the prevalence has not returned back to pre-pandemic levels given that we are now in a post-pandemic environment.
This is a time series that takes incidents as we have seen previously and lays out deaths and injury volume by month.
ms_tbl %>%
filter(incident_month < floor_date(Sys.Date(), "month")) %>%
group_function(incident_month) %>%
rename("Incidents" = n_incidents,"Deaths" = n_deaths,"Injuries" = n_injuries) %>%
pivot_longer(cols = -1) %>%
mutate(name = factor(name, levels = c("Incidents","Injuries","Deaths"))) %>%
ggplot(aes(incident_month, value, group = name, color = name)) +
geom_line() +
geom_point() +
geom_smooth(method = "loess", se = FALSE) +
scale_x_date(date_labels = "%Y", date_breaks = "3 years") +
expand_limits(y = 0) +
theme_tq() +
scale_color_tq() +
facet_wrap(~ name, scales = "free_y") +
labs(title = "Number of Mass Shooting Incidents by Month, Injuries and Deaths",
subtitle = str_glue("From {min(ms_tbl$incident_date)} through {max(ms_tbl %>% filter(incident_month < floor_date(Sys.Date(), 'month')) %>% pull(incident_date))}"),
x = "",
y = "") +
theme(legend.position = "none")
Is there a state that has not had a mass shooting?
us_states <- tibble("state" = state.name) %>%
bind_cols(tibble("state_code" = state.abb)) %>%
bind_rows(tibble(state = "District of Columbia",
state_code = "DC")) %>%
arrange(state)
ms_states_tbl <- ms_tbl %>%
inner_join(us_states, by = "state") %>%
distinct(state_code)
no_mass_shooting_states <- tibble("state_code" = state.abb,
"state_name" = state.name) %>%
anti_join(ms_states_tbl, by = "state_code") %>%
pull(state_name)
state_count <- length(no_mass_shooting_states)
Yes, 2 states have not had mass shootings since these data were collected beginning in 2014: Hawaii and North Dakota
Let’s look at the number of mass shootings by state. This gives us a look at totals.
ms_tbl %>%
group_function(state) %>%
inner_join(us_states, by = "state") %>%
ggplot(aes(fct_reorder(state, n_incidents), n_incidents)) +
geom_col(fill = "steelblue", color = "gray30", alpha = 0.8) +
geom_text(aes(label = n_incidents), hjust = -0.1, size = 2.5) +
scale_y_continuous(limits = c(0, 450), expand = c(0, 0)) +
coord_flip() +
theme_tq() +
labs(title = "States Ordered by Number of Mass Shootings",
subtitle = str_glue("{min(ms_tbl$incident_date)} through {max(ms_tbl$incident_date)}"),
x = "",
y = "Mass Shootings") +
theme(axis.text.y = element_text(size = 6))
state_stats <- ms_tbl %>%
group_function(state) %>%
inner_join(us_states, by = "state") %>%
mutate(incidents_rank = min_rank(desc(n_incidents)))
top_five_states <- state_stats %>%
filter(incidents_rank %in% c(1:5)) %>%
arrange(incidents_rank) %>%
pull(state)
top_states_for_time_series <- state_stats %>%
filter(incidents_rank %in% c(1:4)) %>%
arrange(incidents_rank) %>%
pull(state)
The five states with the greatest number of mass shootings are: Illinois, California, Texas, Florida, and Pennsylvania.
We should consider proportionally how likely it is to encounter a mass shooting (or being injured or killed) instead of just looking at the total because each state has widely differing total populations. In other words, given the population of a state, which states have the greatest number of mass shootings (or deaths or injuries) per person? I will use the 2020 US decennial census population data by state and the number of incidents, injuries, and deaths by state in 2022. I am including Washington, D.C. as a state. Another way to approach this would be to sum all of the years compared to a 2020 population, but I elected to simply choose the most recent complete year on record.
Which ten states have the most mass shootings, deaths, or injuries per capita?
us_state_pop_tbl <- get_decennial(geography = "state",
variables = "P1_001N",
year = 2020)
per_capita_base <- 1e5
per_capita_tbl <- ms_tbl %>%
filter(year == 2022) %>%
group_function(state) %>%
group_by(state) %>%
inner_join(us_state_pop_tbl %>%
select("state" = NAME, "population" = value), by = "state") %>%
mutate(incidents_per_capita = n_incidents / (population / per_capita_base),
deaths_per_capita = n_deaths / (population / per_capita_base),
injuries_per_capita = n_injuries / (population / per_capita_base))
per_capita_tbl %>%
select(state, contains("per_capita")) %>%
pivot_longer(cols = contains("per_capita")) %>%
mutate(name = fct_relevel(name, c("incidents_per_capita","deaths_per_capita","injuriess_per_capita"))) %>%
group_by(name) %>%
slice_max(order_by = value, n = 10) %>%
ggplot(aes(reorder_within(x = state, by = value, within = name, fun = sum), value, fill = name)) +
geom_col(alpha = 0.8) +
geom_text(aes(label = comma(value, accuracy = 0.1)), size = 2, hjust = 1.1, color = "white") +
coord_flip() +
facet_wrap(~ name, scales = "free") +
scale_x_reordered() +
scale_y_continuous(label = label_comma(accuracy = 0.1)) +
theme_tq() +
scale_fill_tq() +
labs(title = "Per Capita Mass Shootings, Deaths, and Injuries by US State in 2022",
x = "",
y = "Per 100,000 People") +
theme(axis.text.y = element_text(size = 6),
legend.position = "none")
Which cities tend to see the most mass shooting incidents since 2014? Are they changing over time?
Twelve seems like an odd number to choose, but it just happens to plot better.
top_cities <- ms_tbl %>%
filter(year < year(Sys.Date())) %>%
group_by(state, city_or_county) %>%
summarize(number_incidents = n(),
number_killed = sum(number_killed),
number_injured = sum(number_injured),
.groups = "drop") %>%
slice_max(order_by = number_incidents, n = 12) %>%
distinct(state, city_or_county) %>%
transmute(city_list = str_c(city_or_county, state, sep = ", "))
top_cities_incidents <- ms_tbl %>%
mutate(city_list = str_c(city_or_county, state, sep = ", ")) %>%
filter(year < year(Sys.Date()),
city_list %in% top_cities$city_list) %>%
group_by(state, city_or_county, year) %>%
summarize(number_incidents = n(),
number_killed = sum(number_killed),
number_injured = sum(number_injured),
.groups = "drop") %>%
complete(year, nesting(state, city_or_county),
fill = list(number_incidents = 0, number_killed = 0, number_injured = 0)) %>%
ungroup()
top_cities_incidents %>%
ggplot(aes(year, number_incidents, fill = city_or_county)) +
geom_col(alpha = 0.9) +
geom_text(aes(label = number_incidents), fill = "white", vjust = 1.5, size = 2, color = "white") +
facet_wrap(~ fct_reorder(str_c(city_or_county, state, sep = ", "), -number_incidents, sum), scales = "free_y") +
expand_limits(y = 0) +
theme_tq() +
scale_fill_tq() +
scale_y_continuous(labels = label_comma(accuracy = 1)) +
labs(x = "",
y = "",
title = str_glue("Top {length(unique(top_cities_incidents$state))} US Cities by Total Number of Mass Shooting Incidents"),
subtitle = "Cities are Sorted in Descending Order",
caption = "current incomplete year removed") +
theme(legend.position = "none",
strip.text.x = element_text(size = 6),
axis.text.x = element_text(size = 5),
axis.text.y = element_text(size = 5))
Which cities have had more than 20 mass shootings since 2014?
The number 20 is somewhat arbitrary, but when looking at the counts of incidents by city, 20 seems to be near an inflection point at which the prevalence of mass shootings in a given city becomes significantly higher. Ten would have also been a reasonable threshold to choose.
ranked_cities <- ms_tbl %>%
filter(year < year(Sys.Date())) %>%
inner_join(us_states, by = "state") %>%
group_by(city = str_c(city_or_county, state_code, sep = ", ")) %>%
summarize(number_incidents = n(),
number_killed = sum(number_killed),
number_injured = sum(number_injured),
min_year = min(year),
max_year = max(year),
.groups = "drop") %>%
arrange(desc(number_incidents)) %>%
mutate(rank = row_number(),
city = fct_reorder(city, number_incidents, max))
top_5_cities <- ranked_cities %>%
slice_min(order_by = rank, n = 5) %>%
pull(city)
ranked_cities %>%
ggplot(aes(rank, number_incidents)) +
geom_point(alpha = 0.9, shape = 1, color = "gray40", size = 1) +
geom_hline(yintercept = 20, color = "blue") +
scale_x_continuous(breaks = seq(0,1000,50)) +
scale_y_continuous(breaks = seq(0,300,20)) +
theme_tq() +
labs(x = "City Rank",
y = "Number of Incidents",
title = str_glue("US Cities Mass Shooting Incidents Counts by City Rank"),
subtitle = str_glue("From {ranked_cities$min_year} to {ranked_cities$max_year}"))
Here are those cities with 20 or more mass shootings since 2014.
ranked_cities %>%
filter(number_incidents >= 20) %>%
ggplot(aes(city, number_incidents, fill = number_incidents)) +
geom_col(alpha = 0.9) +
geom_text(aes(label = number_incidents), size = 2, color = "white", hjust = 1.2) +
coord_flip() +
scale_fill_gradient(low = "blue2", high = "red4") +
scale_y_continuous(expand = c(0,0)) +
theme_tq() +
labs(x = "",
y = "",
title = str_glue("US Cities with 20 or More Mass Shooting Incidents"),
subtitle = str_glue("From {ranked_cities$min_year} to {ranked_cities$max_year}")) +
theme(legend.position = "none",
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6))
The initial reason I started this project was to work on a web scraping endeavor. Scraping the initial mass shooting data was very straightforward, and I learned a great deal about how to accomplish that process in R.
However, as things progressed and I wanted to mine more of the incident data, I discovered that some pages of the GVA website has a robot checking algorithm in place to verify that a browser is likely a human and not a bot. There were several helpful directives online regarding using selenium as a headless browser, which I started doing in using the RSelenium package. The website still recognized that I was using a controlled browser, and it denied my access.
Eventually, I discovered that I needed to adjust some of the Chrome options, and I was not able to make that happen in R, so I knew that I needed to switch to Python from the many solutions I found online. I did just that, and it was a solid learning experience for me as I have been working on enhancing my Python skills, and this was a practical avenue to move forward with that objective.
The following took a few weeks of research, and there were a lot of late nights and weekends put into this effort. In all, it was worth it, but I did question pursuing this at times given the level of effort just to scrape these pages. I really wanted to get the best latitude and longitude information, and an earlier attempt to get coordinates using the gecoded addresses was not the best given that many of the addresses are missing detailed information. GVA seems to have better data on the coordinates behind the scenes.
I started by pulling the incident IDs from the GVA data which then become part of the URL for the incident-specific information. This was from the original data pull for all incidents, but after gathering all of the historical data, I only need to update with new incidents moving forward.
incident_ids <- ms_tbl %>%
distinct(incident_id) # get a list of all of the incident IDs
geo_locations_incidents_tbl <- incident_ids %>%
mutate(pages = str_glue("https://www.gunviolencearchive.org/incident/{incident_id}"))
Here we only need to evaluate on the new incidents and then look up those geolocations.
ms_tbl_updated_geo_locs <- read_rds("ms_tbl_updated_geo_locs.rds")
new_incident_ids <- ms_tbl %>%
anti_join(ms_tbl_updated_geo_locs, by = "incident_id") %>%
distinct(incident_id)
new_geo_locations_incidents_tbl <- new_incident_ids %>%
mutate(pages = str_glue("https://www.gunviolencearchive.org/incident/{incident_id}"))
Now we can scrape using Python which gives us more flexibility than RSelenium did (at least as far as I was able to discern).
# run these in terminal
# pip install selenium
# pip install chromedriver
# pip install beautifulsoup4
# pip install webdriver_auto_update
from selenium import webdriver
import chromedriver_autoinstaller
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
from time import sleep
import pandas as pd
chromedriver_autoinstaller.install()
options = Options() # these are the options needed to hide that the browser was being controlled by a computer and not a human
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument("--profile-directory=Default")
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(options=options) # this established the browser
first_page = "https://www.gunviolencearchive.org/incident/882422" # this is a default page to start with
driver.get(first_page) # this opens the URL in the established browser
# from here, I had to click on a second tab manually, and open another page on the same website which cleared the bot checker I was running into
# then, I was able to run the loop as follows
url_list = r.new_geo_locations_incidents_tbl # creates a dataframe in python from the R dataframe created earlier
temp_list = {
"geo_data": []
} # this is an empty list container that will hold the HTML text as we iterate over the pages.
for url in url_list["pages"]: # loop to open the pages, parse the HTML, parse, the text from the page, pop it into a list, rinse, and repeat
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
geo_loc = soup.get_text()
temp_list["geo_data"].append(geo_loc)
df = pd.DataFrame.from_dict(temp_list) # this moves the list into a dataframe
driver.close() # this closes the session.
# note that I was not able to iterate of 4000 pages as the bot detector blocked my IPs from my VPN a few times
# I broke the chunks down into 200 to 300 at a time, and that seemed to do the trick
# I probably could have used a sleep command to slow the roll, which might have appeased the bot police....
Pull the Python object into R, and back up the scraped data in case of R crash en route.
geo_data <- py$df # this command moves a Python object into an R object...Reticulate is pretty sweet!
write_rds(geo_data, "scraped_geo_data.rds") # I stored the data as I went in case R crashed
Update the geolocations file with the new records and save it.
geo_data <- read_rds("scraped_geo_data.rds")
output <- geo_data %>%
transmute(geo_data = str_extract(geo_data, "Geolocation.*")) # pull the geo location data
ms_tbl_updated <- ms_tbl %>%
anti_join(ms_tbl_updated_geo_locs, by = "incident_id") %>%
bind_cols(output) # combine the original data with the new geo location data
tbl_to_update <- read_rds("ms_tbl_updated_geo_locs.rds") # use this to source a previously stored table and append it
ms_tbl_updated <- tbl_to_update %>%
bind_rows(ms_tbl_updated) # append the new data rows to the old
write_rds(ms_tbl_updated, "ms_tbl_updated_geo_locs.rds") # save it
ms_tbl_updated %>%
summarize(n = n(),
n_dist = n_distinct(incident_id)) # check it
ms_tbl_map_data <- read_rds("ms_tbl_updated_geo_locs.rds")
ms_tbl_map_prep <- ms_tbl_map_data %>%
rename("GeoLocation" = geo_data) %>%
mutate(temp = GeoLocation,
temp = str_remove_all(temp, "Geolocation: ")) %>%
separate(temp, into = c("latitude","longitude"), sep = ", ") %>%
mutate(across(latitude:longitude, ~ as.numeric(.)),
color_year = factor(year),
size = number_killed + number_injured,
date_formatted = format.Date(incident_date, format = "%b %d, %Y"),
label = str_glue("{city_or_county}, {state}
{date_formatted}
Injured: {number_injured} | Killed: {number_killed}"))
Leaflet creates an interactive map that lets you zoom in and out and also mouse over incidents to get more information.
ms_tbl_map_prep %>%
leaflet(height = 1000, width = 1000) %>%
addProviderTiles("Stamen.Toner") %>%
addProviderTiles("Esri", group = "Esri") %>%
addCircleMarkers(label = lapply(str_glue("{month(ms_tbl_map_prep$incident_date, label = TRUE, abbr = TRUE)} {year(ms_tbl_map_prep$incident_date)} <br/> {ms_tbl_map_prep$city_or_county}, {ms_tbl_map_prep$state} <br/> {ms_tbl_map_prep$number_killed} killed | {ms_tbl_map_prep$number_injured} wounded"),
htmltools::HTML), radius = 3, color = "blue", clusterOptions = markerClusterOptions()) %>%
setView(lng = -90, lat = 20, zoom = 3.25) %>%
addLayersControl(baseGroups = c("Stamen.Toner","Esri"))
worst_shootings <- ms_tbl_map_prep %>%
slice_max(order_by = size, n = 10) %>%
slice(1:10)
worst_transformed <- usmap_transform(worst_shootings, input_names = c("longitude", "latitude"), output_names = c("x", "y"))
set.seed(1984)
plot_usmap(fill = "goldenrod1", alpha = 0.8) +
geom_point(data = worst_transformed, aes(x = x, y = y, size = size, color = color_year, text = label)) +
ggrepel::geom_label_repel(data = worst_transformed,
aes(x = x, y = y, label = label),
size = 2,
color = "black",
min.segment.length = 0.1,
segment.colour = "blue3",
box.padding = 1,
arrow = arrow(length = unit(0.01, "npc"))) +
scale_color_tq() +
labs(title = str_glue("Map of the Ten Worst Mass Shooting Incidents Between {year(min(ms_tbl$incident_date))} and {year(max(ms_tbl$incident_date))}"),
subtitle = "Point size represents the total deaths and injuries per incident") +
theme(legend.position = "none")
Time series modeling is a form of machine learning that uses clues in the time series data to create predictive models about what might happen based upon events that have already occurred. I will use data from some of the highest mass shooting states to create predictive time series models. Perhaps leaders from these states can elaborate on these models to help identify when things might get worse and then deploy community resources and programs to try to reduce the number of mass shootings.
I am planning on projecting out 12 months, and the following code sets up the data so that we can iterate over it with several different models and then pick the best one for the forecasts.
One of the coolest features of R is that you can create data tables within data tables meaning that a column which typically contains data about a record can instead contain an entire data set. These are called nested data sets, and they are really lists within lists. Data tables and data frames and tibbles are all simply different names for lists. Nested lists are super useful for keeping data tidy and particularly for iterative modeling of many types.
A convenient aspect of the modeltime package is that it automatically holds out testing data for you so that you don’t have to create separate training and testing data albeit not that complicated to do. This is especially nice with time series data because the testing data is the most recent data as opposed to non-time series modeling where a random set of data are withheld during the training process.
future_date <- 12
ms_model_state_data <- ms_tbl %>%
filter(incident_date < floor_date(Sys.Date(), "month"),
year(incident_date) >= 2020) %>%
group_function(state, incident_month) %>%
select(-n_injuries, -n_deaths) %>%
complete(state, incident_month, fill = list(n_incidents = 0)) %>%
filter(state %in% top_states_for_time_series)
nested_data_tbl <- ms_model_state_data %>%
group_by(state) %>%
modeltime::extend_timeseries(.id_var = state,
.date_var = incident_month,
.length_future = future_date) %>%
modeltime::nest_timeseries(.id_var = state,
.length_future = future_date) %>%
modeltime::split_nested_timeseries(.length_test = future_date)
This code sets up the model recipes and creates workflows that use the recipes and their instructions. I did not use any hyperparameter tuning in this project, but I would recommend that for a more in depth application.
# xgboost recipe
rec_xgb <- recipe(n_incidents ~ ., extract_nested_train_split(nested_data_tbl)) %>%
step_timeseries_signature(incident_month) %>%
step_rm(incident_month) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)
# arima recipe
rec_arima <- recipe(n_incidents ~ ., extract_nested_train_split(nested_data_tbl)) %>%
step_timeseries_signature(incident_month) %>%
step_zv(all_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)
# xgboost models
wflw_xgb_1 <- workflow() %>%
add_model(boost_tree("regression", learn_rate = 0.35) %>% set_engine("xgboost")) %>%
add_recipe(rec_xgb)
wflw_xgb_2 <- workflow() %>%
add_model(boost_tree("regression", learn_rate = 0.50) %>% set_engine("xgboost")) %>%
add_recipe(rec_xgb)
# random forest model
wflw_rf_1 <- workflow() %>%
add_model(rand_forest("regression") %>% set_engine("ranger")) %>%
add_recipe(rec_xgb)
# arima model
wflw_arima_1 <- workflow() %>%
add_model(arima_boost("regression") %>% set_engine("arima_xgboost")) %>%
add_recipe(rec_arima)
# prophet model
wflw_prophet_1 <- workflow() %>%
add_model(prophet_boost("regression") %>% set_engine("prophet_xgboost")) %>%
add_recipe(rec_arima)
The following section fits the models to the data.
# # 2.0 SCALE ----
parallel_start(16)
nested_modeltime_tbl <- nested_data_tbl %>%
modeltime_nested_fit(
model_list = list(
wflw_xgb_1,
wflw_xgb_2,
wflw_rf_1,
wflw_arima_1,
wflw_prophet_1),
control = control_nested_fit(
verbose = TRUE,
allow_par = TRUE))
write_rds(nested_modeltime_tbl, "nested_modeltime_tbl.rds")
Here is a look at the various models and how they did in their predictive capabilities compared to the held out actual data. Earlier, I elected to choose the top 4 states for this time series analysis simply due to plotting space. In reality, and this is one of the reasons that the modeltime package is so powerful, you can run time series models across any number of states (or whatever grouping variable you are interested in). This makes light work of the process of testing multiple models across multiple states.
nested_modeltime_tbl <- read_rds("nested_modeltime_tbl.rds")
nested_modeltime_tbl %>%
extract_nested_test_forecast() %>%
group_by(state) %>%
plot_modeltime_forecast(.facet_ncol = 2)
We select the most accurate models based upon the mean absolute error (MAE) metric. MAE expresses the relative predictive error of models, and we select the models with the lowest relative error.
nested_best_tbl <- nested_modeltime_tbl %>%
modeltime_nested_select_best(metric = "mae", minimize = TRUE)
write_rds(nested_best_tbl, "best_models.rds")
Here is a final look at the best model’s predictions compared with the actual data that we held back.
nested_best_tbl <- read_rds("best_models.rds")
nested_best_tbl %>%
extract_nested_test_forecast() %>%
group_by(state) %>%
plot_modeltime_forecast(.facet_ncol = 2)
This is the final step in the modeling process where we fit the most accurate models to the data including the data that we held back.
nested_best_refit_tbl <- nested_best_tbl %>%
modeltime_nested_refit(
control = control_refit(
verbose = TRUE,
allow_par = TRUE))
parallel_stop()
write_rds(nested_best_refit_tbl, "nested_best_refit_tbl.rds")
Here are the final future forecasts by state. The line represents the expected value or average of the forecast, and the gray area represents the 95% confidence intervals around the prediction. Obviously, no forecast will be absolutely correct, so the confidence intervals give us a reasonable expectation around where the true value will appear in the future.
nested_best_refit_tbl <- read_rds("nested_best_refit_tbl.rds")
nested_best_refit_tbl %>%
extract_nested_future_forecast() %>%
mutate(.conf_lo = case_when(.conf_lo < 0 ~ 0,
TRUE ~ .conf_lo)) %>%
group_by(state) %>%
plot_modeltime_forecast(.facet_ncol = 2)
Mass shootings incidents are increasing since 2014 at an average annual rate of ~11% to 12%. Since 2020, the number of mass shootings have increased, and we appear to have a new higher ‘normal’ environment of mass shootings in the US.
Saturdays and Sundays tend to have more mass shootings and account for about half of the mass shooting weekdays whereas Monday through Friday account for the remaining half.
Summer months account for 35% of mass shootings where winter months account for 15% leaving fall and spring similar at approximately 25% each proportionately.
Only 2 of fifty states since 2014 have not had a mass shooting - Hawaii and North Dakota.
The top five states (including Washington, DC) with the most total mass shootings since 2014 are Illinois, California, Texas, Florida, and Pennsylvania.
Per capita, Washington, D.C. experiences the most mass shootings, injuries, and deaths. This means that on a per person basis, you are more likely to experience a mass shooting in D.C.
Illinois, which is highest in total volume, also makes the list for states with the highest mass shootings, injuries, and deaths per capita.
The top five US cities with the most mass shooting incidents are Chicago, IL, Philadelphia, PA, Baltimore, MD, Houston, TX, and New Orleans, LA.
The time series forecasts reveal trends where these 4 states will predictably continue to see regular mass shooting incidents and even potentially see more as time goes on.